Book

R commands in this chapter

|annotate| coord_polar| gather| grep| grepl| gsub| grexpr| gregexec| regexpr| regexec| reorder| rep| spread| sub|

We have already provided some rules to follow as we created plots for our examples. Here, we aim to provide some general principles we can use as a guide for effective data visualization. Much of this section is based on a talk by Karl Broman1 titled “Creating Effective Figures and Tables”2 and includes some of the figures which were made with code that Karl makes available on his GitHub repository3, as well as class notes from Peter Aldhous’ Introduction to Data Visualization course4. Following Karl’s approach, we show some examples of plot styles we should avoid, explain how to improve them, and use these as motivation for a list of principles. We compare and contrast plots that follow these principles to those that don’t.

The principles are mostly based on research related to how humans detect patterns and make visual comparisons. The preferred approaches are those that best fit the way our brains process visual information. When deciding on a visualization approach, it is also important to keep our goal in mind. We may be comparing a viewable number of quantities, describing distributions for categories or numeric values, comparing the data from two groups, or describing the relationship between two variables. As a final note, we want to emphasize that for a data scientist it is important to adapt and optimize graphs to the audience. For example, an exploratory plot made for ourselves will be different than a chart intended to communicate a finding to a general audience.

We will be using these libraries:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.0
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(dslabs)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

11.1 Encoding data using visual cues

We start by describing some principles for encoding data. There are several approaches at our disposal including position, aligned lengths, angles, area, brightness, and color hue.

To illustrate how some of these strategies compare, let’s suppose we want to report the results from two hypothetical polls regarding browser preference taken in 2000 and then 2015. For each year, we are simply comparing five quantities – the five percentages. A widely used graphical representation of percentages, popularized by Microsoft Excel, is the pie chart:

browsers <- data.frame(Browser = rep(c("Opera", "Safari","Firefox",
                                       "IE", "Chrome"),
                                     2),
                       Year = rep(c(2000, 2015), each = 5),
                       Percentage = c(3,21,23,28,26,2,22,21,27,29)) |>
  mutate(Browser = reorder(Browser, Percentage))
head(browsers)
library(ggthemes)
browsers |>
  ggplot(aes(x = "", y= Percentage, fill = Browser)) +
  geom_bar(width = 1, stat = "identity", col = "black") +
  coord_polar(theta = "y") +
  xlab("") + ylab("") +
  theme(axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid = element_blank()) +
  facet_grid(. ~ Year) -> p1
p1

rep

  • times: an integer-valued vector giving the (non-negative) number of times to repeat each element if of length length(x), or to repeat the whole vector if of length 1. Negative or NA values are an error. A double vector is accepted, other inputs being coerced to an integer or double vector.
  • length.out: non-negative integer. The desired length of the output vector. Other inputs will be coerced to a double vector and the first element taken. Ignored if NA or invalid.
  • each: non-negative integer. Each element of x is repeated each times. Other inputs will be coerced to an integer or double vector and the first element taken. Treated as 1 if NA or invalid.

coord_polar

The polar coordinate system is most commonly used for pie charts, which are a stacked bar chart in polar coordinates.

coord_polar(theta = "x", start = 0, direction = 1, clip = "on")

  • theta: variable to map angle to (x or y)
  • start: Offset of starting point from 12 o’clock in radians. Offset is applied clockwise or anticlockwise depending on value of direction.
  • direction: 1, clockwise; -1, anticlockwise
  • clip: Should drawing be clipped to the extent of the plot panel? A setting of “on” (the default) means yes, and a setting of “off” means no.

Here we are representing quantities with both areas and angles, since both the angle and area of each pie slice are proportional to the quantity the slice represents. This turns out to be a sub-optimal choice since, as demonstrated by perception studies, humans are not good at precisely quantifying angles and are even worse when area is the only available visual cue. The donut chart is an example of a plot that uses only area:

browsers |> ggplot(aes(x = 2, y = Percentage, fill = Browser)) +
  geom_bar(width = 1, stat = "identity", col = "black")  + 
  scale_x_continuous(limits=c(0.5,2.5)) + 
  coord_polar(theta = "y") +
  xlab("") + ylab("") +
  theme(axis.text=element_blank(), 
        axis.ticks = element_blank(), 
        panel.grid  = element_blank()) +
  facet_grid(.~Year)

To see how hard it is to quantify angles and area, note that the rankings and all the percentages in the plots above changed from 2000 to 2015. Can you determine the actual percentages and rank the browsers’ popularity? Can you see how the percentages changed from 2000 to 2015? It is not easy to tell from the plot. In fact, the pie R function help file states that:
Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.

In this case, simply showing the numbers is not only clearer, but would also save on printing costs if printing a paper copy:

browsers |>
  spread(Year, Percentage)

spread

Spread a key-value pair across multiple columns

spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)

  • key, value: <tidy-select> columns to use for key and value.
  • fill: If set, missing values will be replaced with this value. Note that there are two types of missingness in the input: explicit missing values (i.e. NA), and implicit missings, rows that simply aren’t present. Both types of missing value will be replaced by fill.
  • convert: If TRUE, type.convert() with asis = TRUE will be run on each of the new columns. This is useful if the value column was a mix of variables that was coerced to a string. If the class of the value column was factor or date, note that will not be true of the new columns that are produced, which are coerced to character before type conversion.
  • drop: If FALSE, will keep factor levels that don’t appear in the data, filling in missing combinations with fill.
  • sep: If NULL, the column names will be taken from the values of key variable. If non-NULL, the column names will be given by “<key_name><sep><key_value>”.

The preferred way to plot these quantities is to use length and position as visual cues, since humans are much better at judging linear measures. The barplot uses this approach by using bars of length proportional to the quantities of interest. By adding horizontal lines at strategically chosen values, in this case at every multiple of 10, we ease the visual burden of quantifying through the position of the top of the bars. Compare and contrast the information we can extract from the two figures.

browsers |>
  ggplot(aes(Browser, Percentage)) +
  geom_bar(stat = "identity", width = 0.5) +
  ylab("Percent using the Browser") +
  facet_grid(. ~ Year) -> p2
grid.arrange(p1, p2, nrow = 2)

Notice how much easier it is to see the differences in the barplot. In fact, we can now determine the actual percentages by following a horizontal line to the x-axis.

If for some reason you need to make a pie chart, label each pie slice with its respective percentage so viewers do not have to infer them from the angles or area:

library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
browsers <- filter(browsers, Year == 2015)
at <- with(browsers,
           100 - 
             cumsum(c(0,
                      Percentage[-length(Percentage)])) - 
                        0.5 * Percentage)
label <- percent(browsers$Percentage / 100)
at
## [1] 99.0 87.0 65.5 41.5 13.5
label
## [1] "2%"  "22%" "21%" "27%" "29%"
browsers |> 
  ggplot(aes(x = "", y = Percentage, fill = Browser)) +
  geom_bar(width = 1, stat = "identity", col = "black") +
  coord_polar(theta = "y") +
  xlab("") + ylab("") + ggtitle("2015") +
  theme(axis.text=element_blank(), 
        axis.ticks = element_blank(), 
        panel.grid  = element_blank()) +
  annotate(geom = "text", 
              x = 1.62, y =  at, 
              label = label, size=4)

annotate

This function adds geoms to a plot, but unlike a typical geom function, the properties of the geoms are not mapped from variables of a data frame, but are instead passed in as vectors. This is useful for adding small annotations (such as text labels) or if you have your data in vectors, and for some reason don’t want to put them in a data frame.

annotate( geom, x = NULL, y = NULL, xmin = NULL, xmax = NULL, ymin = NULL, ymax = NULL, xend = NULL, yend = NULL, …, na.rm = FALSE)

In general, when displaying quantities, position and length are preferred over angles and/or area. Brightness and color are even harder to quantify than angles. But, as we will see later, they are sometimes useful when more than two dimensions must be displayed at once.

11.2 Know when to include 0

When using barplots, it is misinformative not to start the bars at 0. This is because, by using a barplot, we are implying the length is proportional to the quantities being displayed. By avoiding 0, relatively small differences can be made to look much bigger than they actually are. This approach is often used by politicians or media organizations trying to exaggerate a difference. Below is an illustrative example used by Peter Aldhous in this lecture: http://paldhous.github.io/ucb/2016/dataviz/week2.html.

(Source: Fox News, via Media Matters5.)

From the plot above, it appears that apprehensions have almost tripled when, in fact, they have only increased by about 16%. Starting the graph at 0 illustrates this clearly:

data.frame(Year = as.character(c(2011, 2012, 2013)),
           Southwest_Border_Apprehensions = c(165244,170223,192298)) |>
  ggplot(aes(Year, Southwest_Border_Apprehensions )) +
  geom_bar(stat = "identity", fill = "yellow", col = "black", width = 0.65)

Here is another example, described in detail in a Flowing Data blog post:

(Source: Fox News, via Flowing Data6.)

This plot makes a 13% increase look like a five fold change. Here is the appropriate plot:

data.frame(date = c("Now", "Jan 1, 2013"), tax_rate = c(35, 39.6)) |>
  mutate(date = reorder(date, tax_rate)) |>
  ggplot(aes(date, tax_rate)) + 
  ylab("") + xlab("") +
  geom_bar(stat = "identity", 
           fill = "yellow", 
           col = "black", width = 0.5) + 
  ggtitle("Top Tax Rate If Bush Tax Cut Expires")

Finally, here is an extreme example that makes a very small difference of under 2% look like a 10-100 fold change:

(Source: Venezolana de Televisión via Pakistan Today7 and Diego Mariano.)

data.frame(Candidate = factor(c("Maduro", "Capriles"),
                              levels = c("Maduro", "Capriles")),
           Percent = c(50.66, 49.07)) |>
  ggplot(aes(Candidate, Percent, fill = Candidate)) +
  geom_bar(stat = "identity", width = 0.65, show.legend = FALSE)

When using position rather than length, it is then not necessary to include 0. This is particularly the case when we want to compare differences between groups relative to the within-group variability. Here is an illustrative example showing country average life expectancy stratified across continents in 2012:

gapminder |>
  filter(year == 2012) |>
  ggplot(aes(continent, life_expectancy)) +
  geom_point() -> p1
p2 <- p1 + scale_y_continuous(limits = c(0, 84))
grid.arrange(p2, p1, ncol = 2)

Note that in the plot on the left, which includes 0, the space between 0 and 43 adds no information and makes it harder to compare the between and within group variability.

11.3 Do not distort quantities

During President Barack Obama’s 2011 State of the Union Address, the following chart was used to compare the US GDP to the GDP of four competing nations:

(Source: The 2011 State of the Union Address8)

Judging by the area of the circles, the US appears to have an economy over five times larger than China’s and over 30 times larger than France’s. However, if we look at the actual numbers, we see that this is not the case. The actual ratios are 2.6 and 5.8 times bigger than China and France, respectively. The reason for this distortion is that the radius, rather than the area, was made to be proportional to the quantity, which implies that the proportion between the areas is squared: 2.6 turns into 6.5 and 5.8 turns into 34.1. Here is a comparison of the circles we get if we make the value proportional to the radius and to the area:

gdp <- c(14.6, 5.7, 5.3, 3.3, 2.5)
gdp_data <- data.frame(Country = rep(c("United States", "China",
                                       "Japan", "Germany",
                                       "France"),
                                     2),
           y = factor(rep(c("Radius","Area"),each=5), 
                      levels = c("Radius", "Area")),
           GDP= c(gdp^2/min(gdp^2), gdp/min(gdp))) |> 
   mutate(Country = reorder(Country, GDP))
gdp_data |> 
  ggplot(aes(Country, y, size = GDP)) + 
  geom_point(show.legend = FALSE, color = "blue") + 
  scale_size(range = c(2,25)) +
  coord_flip() + ylab("") + xlab("")

Not surprisingly, ggplot2 defaults to using area rather than radius. Of course, in this case, we really should not be using area at all since we can use position and length:

gdp_data |>
  filter(y == "Area") |>
  ggplot(aes(Country, GDP)) +
  geom_bar(stat = "identity", width = 0.5) +
  ylab("GDP in trillions")

11.4 Order categories by a meaningful value

When one of the axes is used to show categories, as is done in barplots, the default ggplot2 behavior is to order the categories alphabetically when they are defined by character strings. If they are defined by factors, they are ordered by the factor levels. We rarely want to use alphabetical order. Instead, we should order by a meaningful quantity. In all the cases above, the barplots were ordered by the values being displayed. The exception was the graph showing barplots comparing browsers. In this case, we kept the order the same across the barplots to ease the comparison. Specifically, instead of ordering the browsers separately in the two years, we ordered both years by the average value of 2000 and 2015.

We previously learned how to use the reorder function, which helps us achieve this goal. To appreciate how the right order can help convey a message, suppose we want to create a plot to compare the murder rate across states. We are particularly interested in the most dangerous and safest states. Note the difference when we order alphabetically (the default) versus when we order by the actual rate:

data(murders)
murders |>
  mutate(murder_rate = total / population * 10^5) |>
  ggplot(aes(state, murder_rate)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(axis.text.y = element_text(size = 8)) +
  xlab("") -> p1
murders |>
  mutate(murder_rate = total / population * 10^5) |>
  mutate(reorder(state, murder_rate)) |>
  ggplot(aes(state, murder_rate)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme(axis.text.y = element_text(size = 8)) +
  xlab("") -> p2

grid.arrange(p1, p2, ncol = 2)  

The reorder function lets us reorder groups as well. Earlier we saw an example related to income distributions across regions. Here are the two versions plotted against each other:

past_year <- 1970
gapminder |>
  mutate(dollars_per_day = gdp / population / 365) |>
  filter(year == past_year & !is.na(gdp)) |>
  ggplot(aes(region, dollars_per_day)) +
  geom_boxplot() +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  xlab("") -> p1
gapminder |>
  mutate(dollars_per_day = gdp / population / 365) |>
  filter(year == past_year & !is.na(gdp)) |>
  mutate(region = reorder(region, dollars_per_day, FUN = median)) |>
  ggplot(aes(region, dollars_per_day)) +
  geom_boxplot() +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  xlab("") -> p2

grid.arrange(p1, p2, nrow = 1)

The first orders the regions alphabetically, while the second orders them by the group’s median.

11.5 Show the data

We have focused on displaying single quantities across categories. We now shift our attention to displaying data, with a focus on comparing groups.

To motivate our first principle, “show the data”, we go back to our artificial example of describing heights to ET, an extraterrestrial. This time let’s assume ET is interested in the difference in heights between males and females. A commonly seen plot used for comparisons between groups, popularized by software such as Microsoft Excel, is the dynamite plot, which shows the average and standard errors (standard errors are defined in a later chapter, but do not confuse them with the standard deviation of the data). The plot looks like this:

data(heights)
heights |>
  group_by(sex) |>
  summarise(average = mean(height), sd = sd(height)/sqrt(n())) |>
  ggplot(aes(sex, average)) +
  theme_excel() +
  geom_errorbar(aes(ymin = average - 2 * sd,
                    ymax = average + 2 * sd,
                    width = 0.25)) +
  geom_bar(stat = "identity",
           width = 0.5,
           fill = "burlywood",
           color = "black") +
  ylab("Height in inches") -> p1
p1

<span id=“geom_errorbar>

geom_errorbar

Various ways of representing a vertical interval defined by x, ymin and ymax. Each case draws a single graphical object.

  • mapping: Set of aesthetic mappings created by aes(). If specified and inherit.aes = TRUE (the default), it is combined with the default mapping at the top level of the plot. You must supply mapping if there is no plot mapping.

geom_errorbar( mapping = NULL, data = NULL, stat = “identity”, position = “identity”, …, na.rm = FALSE, orientation = NA, show.legend = NA, inherit.aes = TRUE )

The average of each group is represented by the top of each bar and the antennae extend out from the average to the average plus two standard errors. If all ET receives is this plot, he will have little information on what to expect if he meets a group of human males and females. The bars go to 0: does this mean there are tiny humans measuring less than one foot? Are all males taller than the tallest females? Is there a range of heights? ET can’t answer these questions since we have provided almost no information on the height distribution.

This brings us to our first principle: show the data. This simple ggplot2 code already generates a more informative plot than the barplot by simply showing all the data points:

heights |>
  ggplot(aes(sex, height)) +
  geom_point()

For example, this plot gives us an idea of the range of the data. However, this plot has limitations as well, since we can’t really see all the 238 and 812 points plotted for females and males, respectively, and many points are plotted on top of each other. As we have previously described, visualizing the distribution is much more informative. But before doing this, we point out two ways we can improve a plot showing all the points.

The first is to add jitter, which adds a small random shift to each point. In this case, adding horizontal jitter does not alter the interpretation, since the point heights do not change, but we minimize the number of points that fall on top of each other and, therefore, get a better visual sense of how the data is distributed. A second improvement comes from using alpha blending: making the points somewhat transparent. The more points fall on top of each other, the darker the plot, which also helps us get a sense of how the points are distributed. Here is the same plot with jitter and alpha blending:

heights |>
  ggplot(aes(sex, height)) +
  geom_jitter(width = 0.1, alpha = 0.2)

Now we start getting a sense that, on average, males are taller than females. We also note dark horizontal bands of points, demonstrating that many report values that are rounded to the nearest integer.

11.6 Ease comparisons

11.6.1 Use common axes

Since there are so many points, it is more effective to show distributions rather than individual points. We therefore show histograms for each group:

heights |>
  ggplot(aes(height, ..density..)) +
  geom_histogram(binwidth = 1, color = "black") +
  facet_grid(. ~ sex, scales = "free_x")
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

However, from this plot it is not immediately obvious that males are, on average, taller than females. We have to look carefully to notice that the x-axis has a higher range of values in the male histogram. An important principle here is to keep the axes the same when comparing data across two plots. Below we see how the comparison becomes easier:

heights |>
  ggplot(aes(height, ..density..)) +
  geom_histogram(binwidth = 1, color = "black") +
  facet_grid(. ~ sex)

11.6.2 Align plots vertically to see horizontal changes and horizontally to see vertical changes

In these histograms, the visual cue related to decreases or increases in height are shifts to the left or right, respectively: horizontal changes. Aligning the plots vertically helps us see this change when the axes are fixed:

heights |>
  ggplot(aes(height, ..density..)) +
  geom_histogram(binwidth = 1, color = "black") +
  facet_grid(sex ~ .) -> p2
p2

This plot makes it much easier to notice that men are, on average, taller.

If , we want the more compact summary provided by boxplots, we then align them horizontally since, by default, boxplots move up and down with changes in height. Following our show the data principle, we then overlay all the data points:

heights |>
  ggplot(aes(sex, height)) +
  geom_boxplot(coef = 3) +
  geom_jitter(width = 0.1, alpha = 0.2) +
  ylab("Height in inches") -> p3
p3

coef: Length of the whiskers as multiple of IQR. Defaults to 1.5.

Now contrast and compare these three plots, based on exactly the same data:

grid.arrange(p1, p2, p3, ncol=3)

Notice how much more we learn from the two plots on the right. Barplots are useful for showing one number, but not very useful when we want to describe distributions.

11.6.3 Consider transformations

We have motivated the use of the log transformation in cases where the changes are multiplicative. Population size was an example in which we found a log transformation to yield a more informative transformation.

The combination of an incorrectly chosen barplot and a failure to use a log transformation when one is merited can be particularly distorting. As an example, consider this barplot showing the average population sizes for each continent in 2015:

data(gapminder)
gapminder |>
  filter(year == 2015) |>
  group_by(continent) |>
  summarise(population = mean(population)) |>
  mutate(continent = reorder(continent, population)) |>
  ggplot(aes(continent, population/10^6)) +
  geom_bar(stat = "identity", width = 0.5, fill = "blue") +
  theme_excel() +
  ylab("Population in Millions") +
  xlab("Continent") -> p1
p1

From this plot, one would conclude that countries in Asia are much more populous than in other continents. Following the show the data principle, we quickly notice that this is due to two very large countries, which we assume are India and China:

gapminder |> filter(year == 2015) |> 
  mutate(continent = reorder(continent, population, median)) |>
  ggplot(aes(continent, population/10^6)) + 
  ylab("Population in Millions") +
  xlab("Continent") -> p2
p2 + geom_jitter(width = .1, alpha = .5) 

Using a log transformation here provides a much more informative plot. We compare the original barplot to a boxplot using the log scale transformation for the y-axis:

p2 <- p2 +
  geom_boxplot(coef = 3) +
  geom_jitter(width = .1, alpha = .5) +
  scale_y_log10(breaks = c(1, 10, 100, 1000)) +
  theme(axis.text.x = element_text(size = 7))
grid.arrange(p1, p2, ncol = 2)

With the new plot, we realize that countries in Africa actually have a larger median population size than those in Asia.

Other transformations you should consider are the logistic transformation (logit), useful to better see fold changes in odds, and the square root transformation (sqrt), useful for count data.

11.6.4 Visual cues to be compared should be adjacent

For each continent, let’s compare income in 1970 versus 2010. When comparing income data across regions between 1970 and 2010, we made a figure similar to the one below, but this time we investigate continents rather than regions.

gapminder |>
  filter(year %in% c(1970, 2010) & !is.na(gdp)) |>
  mutate(dollars_per_day = gdp / population / 365) |>
  mutate(labels = paste(year, continent)) |>
  ggplot(aes(labels, dollars_per_day)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_y_continuous(trans = "log2") +
  ylab("Income in dollars per day")

The default in ggplot2 is to order labels alphabetically so the labels with 1970 come before the labels with 2010, making the comparisons challenging because a continent’s distribution in 1970 is visually far from its distribution in 2010. It is much easier to make the comparison between 1970 and 2010 for each continent when the boxplots for that continent are next to each other:

gapminder |>
  filter(year %in% c(1970, 2010) & !is.na(gdp)) |>
  mutate(dollars_per_day = gdp / population / 365) |>
  mutate(labels = paste(continent, year)) |>
  ggplot(aes(labels, dollars_per_day)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_y_continuous(trans = "log2") +
  ylab("Income in dollars per day")

11.6.5 Use color

The comparison becomes even easier to make if we use color to denote the two things we want to compare:

gapminder |>
  filter(year %in% c(1970, 2010) & !is.na(gdp)) |>
  mutate(dollars_per_day = gdp / population / 365,
         year = factor(year)) |>
  mutate(labels = paste(continent, year)) |>
  ggplot(aes(labels, dollars_per_day, fill = year)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_y_continuous(trans = "log2") +
  ylab("Income in dollars per day")

11.7 Think of the color blind

About 10% of the population is color blind. Unfortunately, the default colors used in ggplot2 are not optimal for this group. However, ggplot2 does make it easy to change the color palette used in the plots. An example of how we can use a color blind friendly palette is described here: http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#a-colorblind-friendly-palette:

color_blind_friendly_cols <- 
  c("#999999", "#E69F00", "#56B4E9", "#009E73", 
    "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
data.frame(x = 1:8,
           y = rep(1, 8), col = as.character(1:8)) |>
  ggplot(aes(x, y, color = col)) +
  geom_point(size = 8, show.legend = FALSE) +
  theme(axis.title = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid = element_blank()) -> 
  p1

p1 + scale_color_manual(values = color_blind_friendly_cols)

Here are the colors

There are several resources that can help you select colors, for example this one: http://bconnelly.net/2013/10/creating-colorblind-friendly-figures/.

Source: http://jfly.iam.u-tokyo.ac.jp/color/image/pallete.jpg

11.8 Plots for two variables

In general, you should use scatterplots to visualize the relationship between two variables. In every single instance in which we have examined the relationship between two variables, including total murders versus population size, life expectancy versus fertility rates, and infant mortality versus income, we have used scatterplots. This is the plot we generally recommend. However, there are some exceptions and we describe two alternative plots here: the slope chart and the Bland-Altman plot.

There is no geometry for slope charts in ggplot2, but we can construct one using geom_line. We need to do some tinkering to add labels. Below is an example comparing 2010 to 2015 for large western countries:

west <- c("Western Europe","Northern Europe","Southern Europe",
          "Northern America","Australia and New Zealand")

gapminder |>
  filter(year %in% c(2010, 2015) & 
         region %in% west &
         !is.na(life_expectancy) &
         population > 10^7) -> 
  dat

dat |>
  mutate(location = ifelse(year == 2010, 1, 2),
         location = ifelse(year == 2015 &
                             country %in% c("United Kingdom", "Portugal"),
                           location + 0.22,
                           location),
         hjust = ifelse(year == 2010, 1, 0)) |>
  mutate(year = as.factor(year)) |>
  ggplot(aes(year, 
             life_expectancy, 
             group = country)) +
  geom_line(aes(color = country), 
            show.legend = FALSE) +
  geom_text(aes(x = location, 
                label = country, 
                hjust = hjust),
            show.legend = FALSE) +
  xlab("") + ylab("Life Expectancy")

An advantage of the slope chart is that it permits us to quickly get an idea of changes based on the slope of the lines. Although we are using angle as the visual cue, we also have position to determine the exact values. Comparing the improvements is a bit harder with a scatterplot:

library(ggrepel)
c("Western Europe","Northern Europe","Southern Europe",
          "Northern America","Australia and New Zealand") ->
  west

gapminder |> 
  filter(year%in% c(2010, 2015) & region %in% west & 
           !is.na(life_expectancy) & population > 10^7) ->
  dat

dat |> 
  mutate(year = paste0("life_expectancy_", year)) |>
  select(country, year, life_expectancy) |>
  spread(year, life_expectancy) |> 
  ggplot(aes(x=life_expectancy_2010,
             y=life_expectancy_2015, 
             label = country)) + 
  geom_point() + 
  geom_text_repel() +
  scale_x_continuous(limits=c(78.5, 83)) +
  scale_y_continuous(limits=c(78.5, 83)) +
  geom_abline(lty = 2) +
  xlab("2010") + 
  ylab("2015")

In the scatterplot, we have followed the principle use common axes since we are comparing these before and after. However, if we have many points, slope charts stop being useful as it becomes hard to see all the lines.

11.8.2 Bland-Altman plot

Since we are primarily interested in the difference, it makes sense to dedicate one of our axes to it. The Bland-Altman plot, also known as the Tukey mean-difference plot and the MA-plot, shows the difference versus the average:

library(ggrepel)
dat |> 
  mutate(year = paste0("life_expectancy_", year)) |>
  select(country, year, life_expectancy) |> 
  pivot_wider(names_from = "year", 
              values_from = "life_expectancy") |> 
  mutate(average = (life_expectancy_2015 + life_expectancy_2010)/2,
         difference = life_expectancy_2015 - life_expectancy_2010) |>
  ggplot(aes(average, 
             difference, 
             label = country)) + 
  geom_point() +
  geom_text_repel() +
  geom_abline(lty = 2) +
  xlab("Average of 2010 and 2015") + 
  ylab("Difference between 2015 and 2010")

NB paste0() = paste(..., sep = " ")

Here, by simply looking at the y-axis, we quickly see which countries have shown the most improvement. We also get an idea of the overall value from the x-axis.

11.9 Encoding a third variable

An earlier scatterplot showed the relationship between infant survival and average income. Below is a version of this plot that encodes three variables: OPEC membership, region, and population.

present_year <- 2010

gapminder |>
  mutate(region = case_when(
    region %in% west ~ "The West",
    region %in% "Northern Africa" ~ "Northern Africa",
    region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
    region == "Southern Asia"~ "Southern Asia",
    region %in% c("Central America", "South America", "Caribbean") ~ "Latin America",
    continent == "Africa" & region != "Northern Africa" ~ "Sub-Saharan Africa",
    region %in% c("Melanesia", "Micronesia", "Polynesia") ~ "Pacific Islands"),
    dollars_per_day = gdp / population / 365) |>
  filter(year %in% present_year & 
         !is.na(gdp) & !is.na(infant_mortality) & 
         !is.na(region) ) |>
  mutate(OPEC = ifelse(country%in%opec, "Yes", "No")) ->
  dat

dat |>
  ggplot(aes(dollars_per_day,
             1 - infant_mortality / 1000,
             col = region,
             size = population / 10^6,
             pch = OPEC)) +
  scale_x_continuous(trans = "log2",
                     limits = c(0.25, 150)) +
  scale_y_continuous(trans = "logit",
                     limit = c(0.875, .9981),
                     breaks = c(.85, .90, .95, .99, .995, .998)) +
  geom_point(alpha = 0.5) +
  ylab("Infant survival proportion")

We encode categorical variables with color and shape. These shapes can be controlled with shape argument. Below are the shapes available for use in R. For the last five, the color goes inside.

For continuous variables, we can use color, intensity, or size. We now show an example of how we do this with a case study.

When selecting colors to quantify a numeric variable, we choose between two options: sequential and diverging. Sequential colors are suited for data that goes from high to low. High values are clearly distinguished from low values. Here are some examples offered by the package RColorBrewer:

library(RColorBrewer)
display.brewer.all(type="seq")

Diverging colors are used to represent values that diverge from a center. We put equal emphasis on both ends of the data range: higher than the center and lower than the center. An example of when we would use a divergent pattern would be if we were to show height in standard deviations away from the average. Here are some examples of divergent patterns:

display.brewer.all(type = "div")

11.10 Avoid psuedo-three-dimensional plots

The figure below, taken from the scientific literature38, shows three variables: dose, drug type and survival. Although your screen/book page is flat and two-dimensional, the plot tries to imitate three dimensions and assigned a dimension to each variable.

<img src=“https://raw.githubusercontent.com/kbroman/Talk_Graphs/master/Figs/fig8b.png />

Humans are not good at seeing in three dimensions (which explains why it is hard to parallel park) and our limitation is even worse with regard to pseudo-three-dimensions. To see this, try to determine the values of the survival variable in the plot above. Can you tell when the purple ribbon intersects the red one? This is an example in which we can easily use color to represent the categorical variable instead of using a pseudo-3D:

url <- "https://github.com/kbroman/Talk_Graphs/raw/master/R/fig8dat.csv"
dat <- read.csv(url)
dat |> 
  gather(drug, survival, -log.dose) |>
  mutate(drug = gsub("Drug.","",drug)) |>
  ggplot(aes(log.dose, survival, color = drug)) +
  geom_line()   

gather

Gather columns into key-value pairs

[Superseded]

Development on gather() is complete, and for new code we recommend switching to pivot_longer(), which is easier to use, more featureful, and still under active development. df %>% gather("key", "value", x, y, z) is equivalent to df %>% pivot_longer(c(x, y, z), names_to = "key", values_to = "value")

gather( data, key = “key”, value = “value”, …, na.rm = FALSE, convert = FALSE, factor_key = FALSE )

grep

  • grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE) returns a vector of the indices of the elements of x that yielded a match (or not, for invert = TRUE)
  • grepl(pattern, x, ...) returns a logical vector (match or not for each element of x).
  • sub(pattern, replacement, x, ...) and gsub(pattern, replacement, x, ...) return a character vector of the same length and with the same attributes as x (after possible coercion to character). Elements of character vectors x which are not substituted will be returned unchanged (including any declared encoding).
  • regexpr(pattern, text) returns an integer vector of the same length as text giving the starting position of the first match or -1−1 if there is none, with attribute “match.length”, an integer vector giving the length of the matched text (or -1−1 for no match).
  • gregexpr(pattern, text) returns a list of the same length as text each element of which is of the same form as the return value for regexpr, except that the starting positions of every (disjoint) match are given.
  • regexec(pattern, text) returns a list of the same length as text each element of which is either -1−1 if there is no match, or a sequence of integers with the starting positions of the match and all substrings corresponding to parenthesized subexpressions of pattern, with attribute match.length a vector giving the lengths of the matches (or -1−1 for no match). The interpretation of positions and length and the attributes follows regexpr.
  • gregexec(pattern, text) returns the same as regexec, except that to accommodate multiple matches per element of text, the integer sequences for each match are made into columns of a matrix, with one matrix per element of text with matches.

Notice how much easier it is to determine the survival values.

Pseudo-3D is sometimes used completely gratuitously: plots are made to look 3D even when the 3rd dimension does not represent a quantity. This only adds confusion and makes it harder to relay your message. Here are two examples:

11.11 Avoid too many significant digits


  1. http://kbroman.org/↩︎

  2. https://www.biostat.wisc.edu/~kbroman/presentations/graphs2017.pdf↩︎

  3. https://github.com/kbroman/Talk_Graphs↩︎

  4. http://paldhous.github.io/ucb/2016/dataviz/index.html↩︎

  5. http://mediamatters.org/blog/2013/04/05/fox-news-newest-dishonest-chart-immigration-enf/193507↩︎

  6. http://flowingdata.com/2012/08/06/fox-news-continues-charting-excellence/↩︎

  7. https://www.pakistantoday.com.pk/2018/05/18/whats-at-stake-in-venezuelan-presidential-vote↩︎

  8. https://www.youtube.com/watch?v=kl2g40GoRxg↩︎